Setting up covidData

To use the covidData package, you must first do some set-up to install the package and download the data from various sources. It is not as straight-forward as installing a normal R package. The latest instructions on how to install the package can be found on the package’s GitHub page.

Overview of Package Functionality

This R package provides versioned time series data for COVID-19 hospitalizations, cases, and deaths (measure).

There will be examples using these parameters to follow.

Code to retrieve and plot data

The following code retrieves and plots case and death data from JHU CSSE and hospitalization data from HealthData.gov. For more details about how these data are computed and what sources they come from, please see the COVID-19 Forecast Hub Technical README file.

library(dplyr)
library(forcats)
library(ggplot2)
library(covidData)

Daily incident cases, hospitalizations and deaths at the state and national level

Data shown are the incident cases, deaths, or hospitalizations per day at the state and national levels, as reported on December 2, 2020. A single call to load_data retrieves data for one of these measures.

# Load incident cases, hospitalizations, and deaths data at state and national level
uncombined_cases <- load_data(
    issues = "2020-12-02",
    spatial_resolution = c("state", "national"),
    temporal_resolution = "daily",
    measure = "cases"
  )

uncombined_deaths <- load_data(
    issues = "2020-12-02",
    spatial_resolution = c("state", "national"),
    temporal_resolution = "daily",
    measure = "deaths"
  )

uncombined_hospitalizations <- load_data(
    issues = "2020-12-02",
    spatial_resolution = c("state", "national"),
    temporal_resolution = "daily",
    measure = "hospitalizations"
  ) 
# View the separate data frames

tail(uncombined_cases)
## # A tibble: 6 x 4
##   location date            cum    inc
##   <chr>    <date>        <dbl>  <dbl>
## 1 US       2020-11-26 12883264 110611
## 2 US       2020-11-27 13088821 205557
## 3 US       2020-11-28 13244417 155596
## 4 US       2020-11-29 13383320 138903
## 5 US       2020-11-30 13541221 157901
## 6 US       2020-12-01 13721304 180083
tail(uncombined_deaths)
## # A tibble: 6 x 4
##   location date          cum   inc
##   <chr>    <date>      <dbl> <dbl>
## 1 US       2020-11-26 263454  1232
## 2 US       2020-11-27 264858  1404
## 3 US       2020-11-28 266047  1189
## 4 US       2020-11-29 266873   826
## 5 US       2020-11-30 268045  1172
## 6 US       2020-12-01 270642  2597
tail(uncombined_hospitalizations)
## # A tibble: 6 x 4
##   date         inc location   cum
##   <date>     <dbl> <chr>    <dbl>
## 1 2020-11-26 11302 US          NA
## 2 2020-11-27 12971 US          NA
## 3 2020-11-28 12471 US          NA
## 4 2020-11-29 12337 US          NA
## 5 2020-11-30 14116 US          NA
## 6 2020-12-01 14135 US          NA

So we bind the results of three separate calls together to create a unified data frame to use for the plot. The columns in the resulting data frame are date, which represents which day the data corresponds to. The column cum, which represents the cumulative incidence of the measure, and another column inc, which represents the daily/weekly incidence of the measure. The measure columns states whether the data represents cases, deaths, or hospitalizations. The location column in the output from load_data represents locations using their FIPS codes, which are alpha-numeric codes uniquely identifying locations. More human-readable representations of the location names are contained in the fips_codes data frame provided by the covidData package and are joined with the data set.

# Bind incident cases, hospitalizations, and deaths data at state and national level
combined_data <- dplyr::bind_rows(
  uncombined_cases %>%
    dplyr::mutate(measure = "cases"),
  uncombined_deaths %>%
    dplyr::mutate(measure = "deaths"),
  uncombined_hospitalizations %>%
    dplyr::mutate(measure = "hospitalizations")
)
head(combined_data)
## # A tibble: 6 x 5
##   location date         cum   inc measure
##   <chr>    <date>     <dbl> <dbl> <chr>  
## 1 01       2020-01-22     0     0 cases  
## 2 01       2020-01-23     0     0 cases  
## 3 01       2020-01-24     0     0 cases  
## 4 01       2020-01-25     0     0 cases  
## 5 01       2020-01-26     0     0 cases  
## 6 01       2020-01-27     0     0 cases
# Add more human readable location names,
# set location abbreviation as a factor with US first
combined_data <- combined_data %>%
  dplyr::left_join(
    covidData::fips_codes,
    by = "location"
  ) %>%
  dplyr::mutate(
    abbreviation = forcats::fct_relevel(factor(abbreviation), "US")
  )
head(combined_data)
## # A tibble: 6 x 8
##   location date         cum   inc measure location_name location_name_with_s… abbreviation
##   <chr>    <date>     <dbl> <dbl> <chr>   <chr>         <chr>                 <fct>       
## 1 01       2020-01-22     0     0 cases   Alabama       Alabama               AL          
## 2 01       2020-01-23     0     0 cases   Alabama       Alabama               AL          
## 3 01       2020-01-24     0     0 cases   Alabama       Alabama               AL          
## 4 01       2020-01-25     0     0 cases   Alabama       Alabama               AL          
## 5 01       2020-01-26     0     0 cases   Alabama       Alabama               AL          
## 6 01       2020-01-27     0     0 cases   Alabama       Alabama               AL

The results of load_data are then passed in to a ggplot call and produce the below graph showing cases, deaths, and hospitalizations nationally and individually for each state.

# Plot the data
ggplot(
  data = combined_data,
  # data = filter(combined_data, abbreviation %in% c("MA", "SD", "TX")),
  mapping = aes(x = date, y = inc, color = measure)
) +
  geom_smooth(se = FALSE, span = .25) +
  geom_point(alpha = .2) +
  facet_wrap(~abbreviation, ncol = 3, scales = "free_y") +
  scale_y_log10() +
  # scale_x_date(limits=c(as.Date("2020-07-01"), Sys.Date())) +
  theme_bw()

Daily cummulative deaths for national and select states

Using the same data set as above we can also look at cumulative deaths at the state and national level. The data set was filtered to only include the entries that correspond to deaths and using the abbreviation was filtered to only include the national data and state data for Georgia, New York, and Massachusetts. Since we are interested in cumulative deaths we are plotting results from the cum column.

# Plot cumulative deaths at the state and national level
combined_data %>%
  filter(measure == "deaths") %>%
  filter(abbreviation %in% c("US", "NY", "GA", "MA")) %>%
  ggplot(
    mapping = aes(x = date, y = cum)
  ) +
  geom_point(alpha = .2) +
  facet_wrap(~abbreviation, ncol = 2, scales = "free_y") +
  #scale_y_log10() +
  theme_bw()

County level incident cases and deaths for the 14 counties in Massachusetts

Data shown are the incident cases and deaths, per day at the county levels, as reported on January 1, 2021. We do not have county level hospitalization data so this data set only receives data from JHU CSSE.

Note JHU does not report data for Nantucket County or Dukes County in Massachusetts so the plots corresponding to these counties will be empty despite having some cases.

# Load incident cases, and deaths data at county level
county_data <- dplyr::bind_rows(
  load_data(
    issues = "2021-01-01",
    spatial_resolution = "county",
    temporal_resolution = "daily",
    measure = "cases"
  ) %>%
    dplyr::mutate(measure = "cases"),
  load_data(
    issues = "2021-01-01",
    spatial_resolution = "county",
    temporal_resolution = "daily",
    measure = "deaths"
  ) %>%
    dplyr::mutate(measure = "deaths")
)
# Add more human readable location names,
# set location abbreviation as a factor with US first
county_data <- county_data %>%
  dplyr::left_join(
    covidData::fips_codes,
    by = "location"
  ) %>%
  dplyr::mutate(
    abbreviation = forcats::fct_relevel(factor(abbreviation), "US")
  )

All counties in Massachusetts are prefaced by 25 and then 3 digits for the county in their FIPS code, so our data is filtered to only include counties with FIPS codes in the 25000’s.

# Look at county level data for Massachusetts
county_data %>%
  dplyr::filter(location > 25000 & location < 26000) %>% # FIPS codes for MA
  ggplot(
    mapping = aes(x = date, y = inc, color = measure)
  ) +
  geom_smooth(se = FALSE, span = .25) +
  geom_point(alpha = .2) +
  facet_wrap(~location_name, ncol = 3, scales = "free_y") +
  scale_y_log10() +
  theme_bw()

Weekly incidents cases, hospitalizations, and deaths for select states

Data shown are the incident cases, deaths, or hospitalizations per week at the state level (can look at national as well), as reported on December 31, 2020. Using the weekly data decreases the noise in the graphs.

# Load weekly incident cases, hospitalizations, and deaths data at state level
weekly_data <- dplyr::bind_rows(
  load_data(
    issues = "2020-12-31",
    spatial_resolution = c("state"),
    temporal_resolution = "weekly",
    measure = "cases"
  ) %>%
    dplyr::mutate(measure = "cases"),
  load_data(
    issues = "2020-12-31",
    spatial_resolution = c("state"),
    temporal_resolution = "weekly",
    measure = "deaths"
  ) %>%
    dplyr::mutate(measure = "deaths"),
  load_data(
    issues = "2020-12-31",
    spatial_resolution = c("state"),
    temporal_resolution = "weekly",
    measure = "hospitalizations"
  ) %>%
    dplyr::mutate(measure = "hospitalizations")
)
# Add more human readable location names,
# set location abbreviation as a factor 
weekly_data <- weekly_data %>%
  dplyr::left_join(
    covidData::fips_codes,
    by = "location"
  ) %>%
  dplyr::mutate(
    abbreviation = forcats::fct_relevel(factor(abbreviation))
  )

The weekly data set is filtered to show only the data for Maine, Maryland, Massachusetts, and Michigan in the plot.

# Plot the data
ggplot(
  data = filter(weekly_data, abbreviation %in% c("ME", "MD", "MA", "MI")),
  mapping = aes(x = date, y = inc, color = measure)
) +
  geom_smooth(se = FALSE, span = .25) +
  geom_point(alpha = .2) +
  facet_wrap(~location_name, ncol = 2, scales = "free_y") +
  scale_y_log10() +
  theme_bw()

View discrepancies between incident deaths in New Jersey

On August 2, 2020 New Jersey updated their prior incident deaths. Therefore, we can use different as_of dates to see the differing values. as_of dates used are August 1, 2020 and August 2, 2020. The data was loaded via load_data with a data frame per issue date and filtered to only include the daily deaths for New Jersey. An extra column was added to the data sets called issue_date which for all entries states the issue date for that entry. Both individual data sets are combined into one bigger data set which will be plotted from and why the additional issue_date column is important as it identifies what data comes from each issue date.

# Comparison of incident deaths reported in NJ between
# "2020-08-01" and "2020-08-02"


NJ_issue_day_1 <- load_data(
  issues = "2020-08-01",
  spatial_resolution = c("state"),
  temporal_resolution = "daily",
  measure = "deaths"
) %>%
  dplyr::left_join(
    covidData::fips_codes,
    by = "location"
  ) %>%
  dplyr::filter(abbreviation == "NJ") %>%
  dplyr::mutate(issue_date = "2020-08-01")

NJ_issue_day_2 <- load_data(
  issues = "2020-08-02",
  spatial_resolution = c("state"),
  temporal_resolution = "daily",
  measure = "deaths"
) %>%
  dplyr::left_join(
    covidData::fips_codes,
    by = "location"
  ) %>%
  dplyr::filter(abbreviation == "NJ") %>%
  dplyr::mutate(issue_date = "2020-08-02")

NJ_issue_date_comparison <- dplyr::full_join(NJ_issue_day_1, NJ_issue_day_2)

To illustrate the importance of issue date and difference in daily deaths on these two consecutive issue dates in New Jersey the daily incidence death data is plotted for each issue date.

# Plot the differences in daily incidence between 2 consecutive issue dates in New Jersey
ggplot(
  data = NJ_issue_date_comparison,
  mapping = aes(x = date, y = inc, color = issue_date)
) +
  geom_smooth(se = FALSE, span = .25) +
  geom_point(alpha = .2) +
  scale_y_log10() +
  theme_bw()